ACL.2021 - Student Research Workshop

Total: 36

#1 Investigation on Data Adaptation Techniques for Neural Named Entity Recognition [PDF] [Copy] [Kimi]

Authors: Evgeniia Tokarchuk ; David Thulke ; Weiyue Wang ; Christian Dugast ; Hermann Ney

Data processing is an important step in various natural language processing tasks. As the commonly used datasets in named entity recognition contain only a limited number of samples, it is important to obtain additional labeled data in an efficient and reliable manner. A common practice is to utilize large monolingual unlabeled corpora. Another popular technique is to create synthetic data from the original labeled data (data augmentation). In this work, we investigate the impact of these two methods on the performance of three different named entity recognition tasks.

#2 Stage-wise Fine-tuning for Graph-to-Text Generation [PDF] [Copy] [Kimi1]

Authors: Qingyun Wang ; Semih Yavuz ; Xi Victoria Lin ; Heng Ji ; Nazneen Rajani

Graph-to-text generation has benefited from pre-trained language models (PLMs) in achieving better performance than structured graph encoders. However, they fail to fully utilize the structure information of the input graph. In this paper, we aim to further improve the performance of the pre-trained language model by proposing a structured graph-to-text model with a two-step fine-tuning mechanism which first fine-tunes model on Wikipedia before adapting to the graph-to-text generation. In addition to using the traditional token and position embeddings to encode the knowledge graph (KG), we propose a novel tree-level embedding method to capture the inter-dependency structures of the input graph. This new approach has significantly improved the performance of all text generation metrics for the English WebNLG 2017 dataset.

#3 Transformer-Based Direct Hidden Markov Model for Machine Translation [PDF] [Copy] [Kimi1]

Authors: Weiyue Wang ; Zijian Yang ; Yingbo Gao ; Hermann Ney

The neural hidden Markov model has been proposed as an alternative to attention mechanism in machine translation with recurrent neural networks. However, since the introduction of the transformer models, its performance has been surpassed. This work proposes to introduce the concept of the hidden Markov model to the transformer architecture, which outperforms the transformer baseline. Interestingly, we find that the zero-order model already provides promising performance, giving it an edge compared to a model with first-order dependency, which performs similarly but is significantly slower in training and decoding.

#4 AutoRC: Improving BERT Based Relation Classification Models via Architecture Search [PDF] [Copy] [Kimi1]

Author: Wei Zhu

Although BERT based relation classification (RC) models have achieved significant improvements over the traditional deep learning models, it seems that no consensus can be reached on what is the optimal architecture, since there are many design choices available. In this work, we design a comprehensive search space for BERT based RC models and employ a modified version of efficient neural architecture search (ENAS) method to automatically discover the design choices mentioned above. Experiments on eight benchmark RC tasks show that our method is efficient and effective in finding better architectures than the baseline BERT based RC models. Ablation study demonstrates the necessity of our search space design and the effectiveness of our search method. We also show that our framework can also apply to other entity related tasks like coreference resolution and span based named entity recognition (NER).

#5 How Low is Too Low? A Computational Perspective on Extremely Low-Resource Languages [PDF] [Copy] [Kimi1]

Authors: Rachit Bansal ; Himanshu Choudhary ; Ravneet Punia ; Niko Schenk ; Émilie Pagé-Perron ; Jacob Dahl

Despite the recent advancements of attention-based deep learning architectures across a majority of Natural Language Processing tasks, their application remains limited in a low-resource setting because of a lack of pre-trained models for such languages. In this study, we make the first attempt to investigate the challenges of adapting these techniques to an extremely low-resource language – Sumerian cuneiform – one of the world’s oldest written languages attested from at least the beginning of the 3rd millennium BC. Specifically, we introduce the first cross-lingual information extraction pipeline for Sumerian, which includes part-of-speech tagging, named entity recognition, and machine translation. We introduce InterpretLR, an interpretability toolkit for low-resource NLP and use it alongside human evaluations to gauge the trained models. Notably, all our techniques and most components of our pipeline can be generalised to any low-resource language. We publicly release all our implementations including a novel data set with domain-specific pre-processing to promote further research in this domain.

#6 On the Relationship between Zipf’s Law of Abbreviation and Interfering Noise in Emergent Languages [PDF] [Copy] [Kimi1]

Authors: Ryo Ueda ; Koki Washio

This paper studies whether emergent languages in a signaling game follow Zipf’s law of abbreviation (ZLA), especially when the communication ability of agents is limited because of interfering noises. ZLA is a well-known tendency in human languages where the more frequently a word is used, the shorter it will be. Surprisingly, previous work demonstrated that emergent languages do not obey ZLA at all when neural agents play a signaling game. It also reported that a ZLA-like tendency appeared by adding an explicit penalty on word lengths, which can be considered some external factors in reality such as articulatory effort. We hypothesize, on the other hand, that there might be not only such external factors but also some internal factors related to cognitive abilities. We assume that it could be simulated by modeling the effect of noises on the agents’ environment. In our experimental setup, the hidden states of the LSTM-based speaker and listener were added with Gaussian noise, while the channel was subject to discrete random replacement. Our results suggest that noise on a speaker is one of the factors for ZLA or at least causes emergent languages to approach ZLA, while noise on a listener and a channel is not.

#7 Long Document Summarization in a Low Resource Setting using Pretrained Language Models [PDF] [Copy] [Kimi1]

Authors: Ahsaas Bajaj ; Pavitra Dangati ; Kalpesh Krishna ; Pradhiksha Ashok Kumar ; Rheeya Uppaal ; Bradford Windsor ; Eliot Brenner ; Dominic Dotterrer ; Rajarshi Das ; Andrew McCallum

Abstractive summarization is the task of compressing a long document into a coherent short document while retaining salient information. Modern abstractive summarization methods are based on deep neural networks which often require large training datasets. Since collecting summarization datasets is an expensive and time-consuming task, practical industrial settings are usually low-resource. In this paper, we study a challenging low-resource setting of summarizing long legal briefs with an average source document length of 4268 words and only 120 available (document, summary) pairs. To account for data scarcity, we used a modern pre-trained abstractive summarizer BART, which only achieves 17.9 ROUGE-L as it struggles with long documents. We thus attempt to compress these long documents by identifying salient sentences in the source which best ground the summary, using a novel algorithm based on GPT-2 language model perplexity scores, that operates within the low resource regime. On feeding the compressed documents to BART, we observe a 6.0 ROUGE-L improvement. Our method also beats several competitive salience detection baselines. Furthermore, the identified salient sentences tend to agree with independent human labeling by domain experts.

#8 Attending Self-Attention: A Case Study of Visually Grounded Supervision in Vision-and-Language Transformers [PDF] [Copy] [Kimi1]

Authors: Jules Samaran ; Noa Garcia ; Mayu Otani ; Chenhui Chu ; Yuta Nakashima

The impressive performances of pre-trained visually grounded language models have motivated a growing body of research investigating what has been learned during the pre-training. As a lot of these models are based on Transformers, several studies on the attention mechanisms used by the models to learn to associate phrases with their visual grounding in the image have been conducted. In this work, we investigate how supervising attention directly to learn visual grounding can affect the behavior of such models. We compare three different methods on attention supervision and their impact on the performances of a state-of-the-art visually grounded language model on two popular vision-and-language tasks.

#9 Video-guided Machine Translation with Spatial Hierarchical Attention Network [PDF] [Copy] [Kimi1]

Authors: Weiqi Gu ; Haiyue Song ; Chenhui Chu ; Sadao Kurohashi

Video-guided machine translation, as one type of multimodal machine translations, aims to engage video contents as auxiliary information to address the word sense ambiguity problem in machine translation. Previous studies only use features from pretrained action detection models as motion representations of the video to solve the verb sense ambiguity, leaving the noun sense ambiguity a problem. To address this problem, we propose a video-guided machine translation system by using both spatial and motion representations in videos. For spatial features, we propose a hierarchical attention network to model the spatial information from object-level to video-level. Experiments on the VATEX dataset show that our system achieves 35.86 BLEU-4 score, which is 0.51 score higher than the single model of the SOTA method.

#10 Stylistic approaches to predicting Reddit popularity in diglossia [PDF] [Copy] [Kimi1]

Author: Huikai Chua

Past work investigating what makes a Reddit post popular has indicated that style is a far better predictor than content, where posts conforming to a subreddit’s community style are better received. However, what about a diglossia, when there are two community styles? In Singapore, the basilect (‘Singlish’) co-exists with an acrolect (standard English), each with contrasting advantages of community identity and prestige respectively. In this paper, I apply stylistic approaches to predicting Reddit post scores in a diglossia. Using data from the Singaporean and British subreddits, I show that while the acrolect’s prestige attracts more upvotes, the most popular posts also draw on Singlish vocabulary to appeal to the community identity.

#11 “I’ve Seen Things You People Wouldn’t Believe”: Hallucinating Entities in GuessWhat?! [PDF] [Copy] [Kimi1]

Authors: Alberto Testoni ; Raffaella Bernardi

Natural language generation systems have witnessed important progress in the last years, but they are shown to generate tokens that are unrelated to the source input. This problem affects computational models in many NLP tasks, and it is particularly unpleasant in multimodal systems. In this work, we assess the rate of object hallucination in multimodal conversational agents playing the GuessWhat?! referential game. Better visual processing has been shown to mitigate this issue in image captioning; hence, we adapt to the GuessWhat?! task the best visual processing models at disposal, and propose two new models to play the Questioner agent. We show that the new models generate few hallucinations compared to other renowned models available in the literature. Moreover, their hallucinations are less severe (affect task-accuracy less) and are more human-like. We also analyse where hallucinations tend to occur more often through the dialogue: hallucinations are less frequent in earlier turns, cause a cascade hallucination effect, and are often preceded by negative answers, which have been shown to be harder to ground.

#12 How do different factors Impact the Inter-language Similarity? A Case Study on Indian languages [PDF] [Copy] [Kimi]

Authors: Sourav Kumar ; Salil Aggarwal ; Dipti Misra Sharma ; Radhika Mamidi

India is one of the most linguistically diverse nations of the world and is culturally very rich. Most of these languages are somewhat similar to each other on account of sharing a common ancestry or being in contact for a long period of time. Nowadays, researchers are constantly putting efforts in utilizing the language relatedness to improve the performance of various NLP systems such as cross lingual semantic search, machine translation, sentiment analysis systems, etc. So in this paper, we performed an extensive case study on similarity involving languages of the Indian subcontinent. Language similarity prediction is defined as the task of measuring how similar the two languages are on the basis of their lexical, morphological and syntactic features. In this study, we concentrate only on the approach to calculate lexical similarity between Indian languages by looking at various factors such as size and type of corpus, similarity algorithms, subword segmentation, etc. The main takeaways from our work are: (i) Relative order of the language similarities largely remain the same, regardless of the factors mentioned above, (ii) Similarity within the same language family is higher, (iii) Languages share more lexical features at the subword level.

#13 COVID-19 and Misinformation: A Large-Scale Lexical Analysis on Twitter [PDF] [Copy] [Kimi1]

Authors: Dimosthenis Antypas ; Jose Camacho-Collados ; Alun Preece ; David Rogers

Social media is often used by individuals and organisations as a platform to spread misinformation. With the recent coronavirus pandemic we have seen a surge of misinformation on Twitter, posing a danger to public health. In this paper, we compile a large COVID-19 Twitter misinformation corpus and perform an analysis to discover patterns with respect to vocabulary usage. Among others, our analysis reveals that the variety of topics and vocabulary usage are considerably more limited and negative in tweets related to misinformation than in randomly extracted tweets. In addition to our qualitative analysis, our experimental results show that a simple linear model based only on lexical features is effective in identifying misinformation-related tweets (with accuracy over 80%), providing evidence to the fact that the vocabulary used in misinformation largely differs from generic tweets.

#14 Situation-Based Multiparticipant Chat Summarization: a Concept, an Exploration-Annotation Tool and an Example Collection [PDF] [Copy] [Kimi]

Authors: Anna Smirnova ; Evgeniy Slobodkin ; George Chernishev

Currently, text chatting is one of the primary means of communication. However, modern text chat still in general does not offer any navigation or even full-featured search, although the high volumes of messages demand it. In order to mitigate these inconveniences, we formulate the problem of situation-based summarization and propose a special data annotation tool intended for developing training and gold-standard data. A situation is a subset of messages revolving around a single event in both temporal and contextual senses: e.g, a group of friends arranging a meeting in chat, agreeing on date, time, and place. Situations can be extracted via information retrieval, natural language processing, and machine learning techniques. Since the task is novel, neither training nor gold-standard datasets for it have been created yet. In this paper, we present the formulation of the situation-based summarization problem. Next, we describe Chat Corpora Annotator (CCA): the first annotation system designed specifically for exploring and annotating chat log data. We also introduce a custom query language for semi-automatic situation extraction. Finally, we present the first gold-standard dataset for situation-based summarization. The software source code and the dataset are publicly available.

#15 Modeling Text using the Continuous Space Topic Model with Pre-Trained Word Embeddings [PDF] [Copy] [Kimi1]

Authors: Seiichi Inoue ; Taichi Aida ; Mamoru Komachi ; Manabu Asai

In this study, we propose a model that extends the continuous space topic model (CSTM), which flexibly controls word probability in a document, using pre-trained word embeddings. To develop the proposed model, we pre-train word embeddings, which capture the semantics of words and plug them into the CSTM. Intrinsic experimental results show that the proposed model exhibits a superior performance over the CSTM in terms of perplexity and convergence speed. Furthermore, extrinsic experimental results show that the proposed model is useful for a document classification task when compared with the baseline model. We qualitatively show that the latent coordinates obtained by training the proposed model are better than those of the baseline model.

#16 Semantics of the Unwritten: The Effect of End of Paragraph and Sequence Tokens on Text Generation with GPT2 [PDF] [Copy] [Kimi1]

Authors: He Bai ; Peng Shi ; Jimmy Lin ; Luchen Tan ; Kun Xiong ; Wen Gao ; Jie Liu ; Ming Li

The semantics of a text is manifested not only by what is read but also by what is not read. In this article, we will study how those implicit “not read” information such as end-of-paragraph () and end-of-sequence () affect the quality of text generation. Specifically, we find that the pre-trained language model GPT2 can generate better continuations by learning to generate the in the fine-tuning stage. Experimental results on English story generation show that can lead to higher BLEU scores and lower perplexity. We also conduct experiments on a self-collected Chinese essay dataset with Chinese-GPT2, a character level LM without and during pre-training. Experimental results show that the Chinese GPT2 can generate better essay endings with .

#17 Data Augmentation with Unsupervised Machine Translation Improves the Structural Similarity of Cross-lingual Word Embeddings [PDF] [Copy] [Kimi1]

Authors: Sosuke Nishikawa ; Ryokan Ri ; Yoshimasa Tsuruoka

Unsupervised cross-lingual word embedding(CLWE) methods learn a linear transformation matrix that maps two monolingual embedding spaces that are separately trained with monolingual corpora. This method relies on the assumption that the two embedding spaces are structurally similar, which does not necessarily hold true in general. In this paper, we argue that using a pseudo-parallel corpus generated by an unsupervised machine translation model facilitates the structural similarity of the two embedding spaces and improves the quality of CLWEs in the unsupervised mapping method. We show that our approach outperforms other alternative approaches given the same amount of data, and, through detailed analysis, we show that data augmentation with the pseudo data from unsupervised machine translation is especially effective for mapping-based CLWEs because (1) the pseudo data makes the source and target corpora (partially) parallel; (2) the pseudo data contains information on the original language that helps to learn similar embedding spaces between the source and target languages.

#18 Joint Detection and Coreference Resolution of Entities and Events with Document-level Context Aggregation [PDF] [Copy] [Kimi1]

Authors: Samuel Kriman ; Heng Ji

Constructing knowledge graphs from unstructured text is an important task that is relevant to many domains. Most previous work focuses on extracting information from sentences or paragraphs, due to the difficulty of analyzing longer contexts. In this paper we propose a new jointly trained model that can be used for various information extraction tasks at the document level. The tasks performed by this system are entity and event identification, typing, and coreference resolution. In order to improve entity and event typing, we utilize context-aware representations aggregated from the detected mentions of the corresponding entities and events across the entire document. By extending our system to document-level, we can improve our results by incorporating cross-sentence dependencies and additional contextual information that might not be available at the sentence level, which allows for more globally optimized predictions. We evaluate our system on documents from the ACE05-E+ dataset and find significant improvement over the sentence-level SOTA on entity and event trigger identification and classification.

#19 “Hold on honey, men at work”: A semi-supervised approach to detecting sexism in sitcoms [PDF] [Copy] [Kimi1]

Authors: Smriti Singh ; Tanvi Anand ; Arijit Ghosh Chowdhury ; Zeerak Waseem

Television shows play an important role inpropagating societal norms. Owing to the popularity of the situational comedy (sitcom) genre, it contributes significantly to the over-all development of society. In an effort to analyze the content of television shows belong-ing to this genre, we present a dataset of dialogue turns from popular sitcoms annotated for the presence of sexist remarks. We train a text classification model to detect sexism using domain adaptive learning. We apply the model to our dataset to analyze the evolution of sexist content over the years. We propose a domain-specific semi-supervised architecture for the aforementioned detection of sexism. Through extensive experiments, we show that our model often yields better classification performance over generic deep learn-ing based sentence classification that does not employ domain-specific training. We find that while sexism decreases over time on average,the proportion of sexist dialogue for the most sexist sitcom actually increases. A quantitative analysis along with a detailed error analysis presents the case for our proposed methodology

#20 Observing the Learning Curve of NMT Systems With Regard to Linguistic Phenomena [PDF] [Copy] [Kimi1]

Authors: Patrick Stadler ; Vivien Macketanz ; Eleftherios Avramidis

In this paper we present our observations and evaluations by observing the linguistic performance of the system on several steps on the training process of various English-to-German Neural Machine Translation models. The linguistic performance is measured through a semi-automatic process using a test suite. Among several linguistic observations, we find that the translation quality of some linguistic categories decreased within the recorded iterations. Additionally, we notice some drops of the translation quality of certain categories when using a larger corpus.

#21 Improving the Robustness of QA Models to Challenge Sets with Variational Question-Answer Pair Generation [PDF] [Copy] [Kimi1]

Authors: Kazutoshi Shinoda ; Saku Sugawara ; Akiko Aizawa

Question answering (QA) models for reading comprehension have achieved human-level accuracy on in-distribution test sets. However, they have been demonstrated to lack robustness to challenge sets, whose distribution is different from that of training sets. Existing data augmentation methods mitigate this problem by simply augmenting training sets with synthetic examples sampled from the same distribution as the challenge sets. However, these methods assume that the distribution of a challenge set is known a priori, making them less applicable to unseen challenge sets. In this study, we focus on question-answer pair generation (QAG) to mitigate this problem. While most existing QAG methods aim to improve the quality of synthetic examples, we conjecture that diversity-promoting QAG can mitigate the sparsity of training sets and lead to better robustness. We present a variational QAG model that generates multiple diverse QA pairs from a paragraph. Our experiments show that our method can improve the accuracy of 12 challenge sets, as well as the in-distribution accuracy.

#22 Tools Impact on the Quality of Annotations for Chat Untangling [PDF] [Copy] [Kimi1]

Authors: Jhonny Cerezo ; Felipe Bravo-Marquez ; Alexandre Henri Bergel

The quality of the annotated data directly influences in the success of supervised NLP models. However, creating annotated datasets is often time-consuming and expensive. Although the annotation tool takes an important role, we know little about how it influences annotation quality. We compare the quality of annotations for the task of chat-untangling made by non-experts annotators using two different tools. The first is SLATE, an existing command-line based tool, and the second is Parlay, a new tool we developed that integrates mouse interaction and visual links. Our experimental results indicate that, while both tools perform similarly in terms of annotation quality, Parlay offers a significantly better user experience.

#23 How Many Layers and Why? An Analysis of the Model Depth in Transformers [PDF] [Copy] [Kimi1]

Authors: Antoine Simoulin ; Benoit Crabbé

In this study, we investigate the role of the multiple layers in deep transformer models. We design a variant of Albert that dynamically adapts the number of layers for each token of the input. The key specificity of Albert is that weights are tied across layers. Therefore, the stack of encoder layers iteratively repeats the application of the same transformation function on the input. We interpret the repetition of this application as an iterative process where the token contextualized representations are progressively refined. We analyze this process at the token level during pre-training, fine-tuning, and inference. We show that tokens do not require the same amount of iterations and that difficult or crucial tokens for the task are subject to more iterations.

#24 Edit Distance Based Curriculum Learning for Paraphrase Generation [PDF] [Copy] [Kimi1]

Authors: Sora Kadotani ; Tomoyuki Kajiwara ; Yuki Arase ; Makoto Onizuka

Curriculum learning has improved the quality of neural machine translation, where only source-side features are considered in the metrics to determine the difficulty of translation. In this study, we apply curriculum learning to paraphrase generation for the first time. Different from machine translation, paraphrase generation allows a certain level of discrepancy in semantics between source and target, which results in diverse transformations from lexical substitution to reordering of clauses. Hence, the difficulty of transformations requires considering both source and target contexts. Experiments on formality transfer using GYAFC showed that our curriculum learning with edit distance improves the quality of paraphrase generation. Additionally, the proposed method improves the quality of difficult samples, which was not possible for previous methods.

#25 Changing the Basis of Contextual Representations with Explicit Semantics [PDF] [Copy] [Kimi1]

Authors: Tamás Ficsor ; Gábor Berend

The application of transformer-based contextual representations has became a de facto solution for solving complex NLP tasks. Despite their successes, such representations are arguably opaque as their latent dimensions are not directly interpretable. To alleviate this limitation of contextual representations, we devise such an algorithm where the output representation expresses human-interpretable information of each dimension. We achieve this by constructing a transformation matrix based on the semantic content of the embedding space and predefined semantic categories using Hellinger distance. We evaluate our inferred representations on supersense prediction task. Our experiments reveal that the interpretable nature of transformed contextual representations makes it possible to accurately predict the supersense category of a word by simply looking for its transformed coordinate with the largest coefficient. We quantify the effects of our proposed transformation when applied over traditional dense contextual embeddings. We additionally investigate and report consistent improvements for the integration of sparse contextual word representations into our proposed algorithm.